"Will it snow tomorrow?" - The time traveler asked

The following dataset contains climate information form over 9000 stations accross the world. The overall goal of these subtasks will be to predict whether it will snow tomorrow 11 years ago. So if today is 2021.02.15 then the weather we want to forecast is for the date 2010.02.16. You are suppsed to solve the tasks using Big Query, which can be used in the Jupyter Notebook like it is shown in the following cell. For further information and how to used BigQuery in Jupyter Notebook refer to the Google Docs.

The goal of this test is, to test your coding knowledge in Python, BigQuery and Pandas as well as your understanding of Data Science. If you get stuck at the first part, you can use the replacement data provided in the second part

Part 1

1. Task

Change the date format to 'YYYY-MM-DD' and select the data from 2006 till 2010 for station numbers including and between 725300 and 726300 , and save it as a pandas dataframe. Note the maximum year available is 2010.

Comments: testing simple logical conditining, knowledge about sql syntax, and ability to find in doc how to store as df variable

2. Task

From here want to work with the data from all stations that have information from 2005 till 2010. Select the relevant data.

Because the dataset without limitations had more than 17000000 rows and I decided to limit data to Stations number between 725200 AND 726400

Data exploration

Dataframe has 31 columns and 488117 rows. Data types in df bool, datetime64[ns], float64, int64, object. They are also missing values in data frame.

In first step I decided to drop following columns:

The above plot show that we have:

We see that the dataset is unbalanced. This information will be important when we will be choose metrices to evaluate our model

Preprocessing and feature engineering

Do a first analysis of the remaining dataset, clean or drop data depending on how you see appropriate.

We see that one station can has more than one wban_number. To simplyfy analysis I will drop columns with wban_number. I decided also to drop max_temp_explicit, tornado.

We still have NaN values in five columns. I will use ffill() to fill this value

Add features

I'm lagging features 'snow' and 'mean_temp' with the assumption that this are were important features.

Data without NaN, new columns were added

3. Task

Split data

Part 2 - Modeling

For modeling I decided to use RandomForestClassifier. I used cross validation TimeSeriesSplit this works like a expanding window on the time dimension.
For example I used data for year 2006 and validate on 2007, then in the next fold I used the data from 2006 and 2007 and validate on 2008 etc. (see plots below) Accuracy, precision and recall for train, valid and test were calculated.

A baseline model predicts tommorow snow based only on if it's snowing today.

I calculated precision and recall because dataset is unbalanced. We have more no-snowing days then snowing.

Modeling for a separate station

Modeling for every station and weather prediction for tommorow.

Summary

What can leads to better results?